Supervised learning for detection of duplicates in genomic sequence databases

Q Chen; J Zobel; X Zhang; K Verspoor

Journal article

Supervised learning for detection of duplicates in genomic sequence databases

Q Chen, J Zobel, X Zhang, K Verspoor

Plos One | PUBLIC LIBRARY SCIENCE | Published : 2016

DOI: 10.1371/journal.pone.0159644

Open access

Download PDF

Abstract

Motivation First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. Results We developed and evaluated a supe..

View full abstract